Optical Character Recognition: Classification of Handwritten Digits and Computer Fonts
نویسنده
چکیده
Optical character Recognition (OCR) is an important application of machine learning where an algorithm is trained on a data set of known letters/digits and can learn to accurately classify letters/digits. A variety of algorithms have shown excellent accuracy for the problem of handwritten digits, 4 of which are looked at here. Additionally, we attempt to extend these techniques to the harder problem of classifying various characters written in different fonts, and achieve accuracies of ~4% and ~7% on these two data sets, respectively, using support vector machines. Finally, we implement PCA to see how the algorithms will fare using a smaller dimensional space and find the interesting result that the more difficult to classify data set (computer fonts) can be accurately classified using a smaller dimensional projection space. Data Sets For this project, I used two data sets of digits: a subset of the MNIST database of handwritten digits and the “not MNIST” database of various fonts/images of computer writing (perhaps something that can be found from trying to read an image). Each set contained 10 various characters/digits, 1500 instances of each classification and 300 testing examples for each letter/digit, with representative images shown in Fig. 1. Each of the digits is a 28x28 pixel image, resulting in a 784 dimensional space. The MNIST database is well-studied and has had many algorithms applied to it for letter classification, and serves as a baseline for implementing our algorithms. The notMNIST database appears to be a harder task, and so we want to understand why that is so and where our algorithms fail. Algorithms Implemented kNN We implemented k-nearest neighbors, where the classification of a point only depends on the closest k points in Euclidean distance. Varying the values of k, the best results were obtained by using small values of k and results are shown for k=3. C-SVM with 4 degree polynomial kernel With the help of LIBSVM, we implemented C-SVM which attempts to solve the following optimization problem: s.t. Fig. 1 Images of representative digits and characters from the MNIST (left) and notMNIST datasets. Utilizing a 4 degree polynomial kernel . Once the optimum margin classifier is known, binary classification can be performed. Since we are classifying a set of 10 digits/characters, we separately implemented one vs. one classification, where each training setoff each character is trained against each other training set. Finally, during testing, the character is classified according to all the optimum margin classifiers and then the one which occurs the most often is taken as the classification. Values for C and gamma were determined by getting low training error. C-SVM with linear kernel Using LIBLINEAR, we also implemented an SVM using a linear kernel, very similar to above. However in this case, the kernel was a linear function, and the C term in the above equation was , ie. L2 regularization rather than L1. Again, we used one vs. one method of making a multi-class classifier. L1 Regularized logistic regression We used LIBLINEAR to implement logistic regression with L2 regularization, where there is a term in the maximized likelihood. We would like to maximize w.r.t. where is a sigmoid function: This is once again a binary classifier, and we implemented a one vs. one extension to classify our 10 digits/characters. It should be pointed out that while the first 2 techniques lead to non-linear classifiers (the 4-th degree polynomial kernel maps to a high dimensional space, where it fits a linear classifier, but that boundary will not be linear in the original space), while the second two decision boundaries are hyperplanes in the original space of the characters. The set of handwritten characters is in an extremely high dimensional space (784 pixels), but is without a doubt, much lower dimensional. In addition, this space is highly nonlinear: one might imagine that space of characters are a k-dimensional manifold mapped into R, where n in this case is 784, and k is much much less than n. Results of Classifiers The results of our 4 techniques are shown below in table 2, with our polynomial SVM outperforming the other techniques by a reasonable margin. Additionally, kNN performed the second best, which may be rather surprising due to the extremely unsophisticated nature of the technique, however, it is interesting that the best techniques are ones where the classifier boundary is not linear in the original space of the characters. This might be expected, and Algorithm Classification Error MNIST Classification Error MNIST Logistic Regression 9.6% 10.6% Linear SVM 11.1% 14.6% SVM 4 poly 4.1% 7.7%
منابع مشابه
A Modfied Self-organizing Map Neural Network to Recognize Multi-font Printed Persian Numerals (RESEARCH NOTE)
This paper proposes a new method to distinguish the printed digits, regardless of font and size, using neural networks.Unlike our proposed method, existing neural network based techniques are only able to recognize the trained fonts. These methods need a large database containing digits in various fonts. New fonts are often introduced to the public, which may not be truly recognized by the Opti...
متن کاملGujarati handwritten numeral optical character reorganization through neural network
This paper deals with an optical character recognition (OCR) system for handwritten Gujarati numbers. One may find so much of work for Indian languages like Hindi, Kannada, Tamil, Bangala, Malayalam, Gurumukhi etc, but Gujarati is a language for which hardly any work is traceable especially for handwritten characters. Here in this work a neural network is proposed for Gujarati handwritten digit...
متن کاملPersian Handwritten Digit Recognition Using Particle Swarm Probabilistic Neural Network
Handwritten digit recognition can be categorized as a classification problem. Probabilistic Neural Network (PNN) is one of the most effective and useful classifiers, which works based on Bayesian rule. In this paper, in order to recognize Persian (Farsi) handwritten digit recognition, a combination of intelligent clustering method and PNN has been utilized. Hoda database, which includes 80000 P...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملA Methodology for Handwritten Character Recognition Using SVM
This paper discusses a methodology for handwritten character recognition applying feature subset selection to reduce number of features. Its novelty lies in the use of a genetic algorithm for the preparation of input data for a support vector machine which is employed to recognize the handwritten Persian digits in particular. Comprehensive experiments on handwritten Persian digits demonstrate t...
متن کامل